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Abstract — In many practical situations we would like to es- 
timate the covariance matrix of a set of variables from an 
insufficient amount of data. More specifically, if we have a set 
of N independent, identically distributed measurements of an M 
dimensional random vector the maximum likelihood estimate is 
the sample covariance matrix. Here we consider the case where 
N < M such that this estimate is singular (non-invertible) 
and therefore fundamentally bad. We present a radically new 
approach to deal with this situation. Let X be the M x N data 
matrix, where the columns are the N independent realizations 
of the random vector with covariance matrix E. Without loss of 
generality, and for simplicity, we can assume that the random 
variables have zero mean. We would like to estimate E from 
X. Let K be the classical sample covariance matrix. Fix a 
parameter 1 < L < N and consider an ensemble of L x M 
random unitary matrices, {$}, having Haar probability measure 
(isotropically random). Pre- and post-multiply K by and by 
the conjugate transpose of $ respectively, to produce a non- 
singular L x L reduced dimension covariance estimate. A new 
estimate for E, denoted by covl(K), is obtained by a) projecting 
the reduced covariance estimate out (to M x M) through pre- 
and post-multiplication by the conjugate transpose of <£>, and 
by $ respectively, and b) taking the expectation over the unitary 
ensemble. Another new estimate (this time for E _1 ), invcov /.(.ft'), 
is obtained by a) inverting the reduced covariance estimate, 
b) projecting the inverse out (to M x M) through pre- and 
post-multiplication by the conjugate transpose of <E>, and by 
$ respectively, and c) taking the expectation over the unitary 
ensemble. We show that the estimate cov is equivalent to diagonal 
loading. Both estimates invcov and cov retain the original 
eigenvectors and make nonzero the formerly zero eigenvalues. 
We have a closed form analytical expression for invcov in terms 
of its eigenvector and eigenvalue decomposition. We motivate 
the use of invcov through applications to linear estimation, 
supervised learning, and high-resolution spectral estimation. We 
also compare the performance of the estimator invcov with 
respect to diagonal loading. 

Index Terms — Singular Covariance Matrices, Random Matri- 
ces, Limiting Distribution, Sensor Networks, Isotropically Ran- 
dom, Stiefel Manifold, Curse of Dimensionality 



I. Introduction 

The estimation of a covariance matrix from an insufficient 
amount of data is one of the most common multivariate 
problems in statistics, signal processing, and learning theory. 
Inexpensive sensors permit ever more measurements to be 
taken simultaneously. Thus the dimensions of feature vec- 
tors are growing. But typically the number of independent 
measurements of the feature vector are not increasing at a 
commensurate rate. Consequently, for many problems, the 
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sample covariance matrix is almost always singular (non- 
invertible). More precisely, given a set of independent multi- 
variate Gaussian feature vectors, the sample covariance matrix 
is a maximum likelihood estimate. When the number of feature 
vectors is smaller than their dimension then the estimate is 
singular, and the sample covariance is a fundamentally bad 
estimate in the sense that the maximum likelihood principle 
yields a non-unique estimate having infinite likelihood. The 
sample covariance finds linear relations among the random 
variables when there may be none. The estimates for the larger 
eigenvalues are typically too big, and the estimates for the 
small eigenvalues are typically too small. 

The conventional treatment of covariance singularity artifi- 
cially converts the singular sample covariance matrix into an 
invertible (positive-definite) covariance by the simple expedi- 
ent of adding a positive diagonal matrix, or more generally, by 
taking a linear combination of the sample covariance and an 
identity matrix. This procedure is variously called "diagonal 
loading" or "ridge regression" [12], [10]. The resulting covari- 
ance has the same eigenvectors as the sample covariance, and 
eigenvalues which are uniformly scaled and shifted versions 
of the sample covariance eigenvalues. The method of Ledoit 
and Wolf [7] automatically chooses the combining coefficients 
for diagonal loading. 

We propose a radically different alternative to diagonal load- 
ing which is based on an ensemble of dimensionality reducing 
random unitary matrices. The concept is that the unitary matrix 
multiplies the feature vectors to produce shortened feature vec- 
tors, having dimension significantly smaller than the number 
of feature vectors, which produce a statistically meaningful 
and invertible covariance estimate. The covariance estimate 
is used to compute an estimate for the ultimate quantity or 
quantities of interest. Finally this estimate is averaged over 
the ensemble of unitary matrices. We consider two versions 
of this scheme which we call cov and incov. We show that the 
estimate cov is equivalent to diagonal loading. Both estimates 
invcov and cov retain the original eigenvectors and make 
nonzero the formerly zero eigenvalues. We have a closed form 
analytical expression for invcov in terms of its eigenvector 
and eigenvalue decomposition. We motivate the use of invcov 
through applications to linear estimation, supervised learning, 
and high-resolution spectral estimation. We also compare the 
performance of the estimator invcov with respect to diagonal 
loading. 

Throughout the paper we will denote by A* the complex 
conjugate transpose of the matrix A. In will represent the 
N x N identity matrix. We let Tr be the non-normalized trace 
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for square matrices, denned by, 



N 



Tr(A) :=J2< 



where an are the diagonal elements of the N x N matrix A. 
We also let tr be the normalized trace, defined by tr(A) = 

II. New Approach to Handling Covariance 
Singularity 

We begin with a set of N independent identically distributed 
measurements of an M dimensional random vector where 
N < M. We introduce an ensemble of L x M random unitary 
matrices, such that L < N. The unitary matrix multiplies 
the feature vectors to produce a set of TV feature vectors 
of dimension L from which we obtain an invertible sample 
covariance matrix. The dimensionality reduction process is 
reversible (i.e., no information is thrown away) provided it 
is done for a sufficient multiplicity of independent unitary 
matrices. The key question is what to do with the ensemble 
of reduced dimension covariance estimates. 

A. Notation and sample covariance 

We are given a M x N data matrix, X, the columns of which 
comprise TV independent identically distributed realizations 
of a random vector. For convenience we assume that the 
random vector is zero-mean. We also will assume that the 
random vector is circularly-symmetric complex. The sample 
covariance is 



K 



1 

N' 



XX*. 



(1) 



We are interested in the case where N < M. Consequently 
the sample covariance is singular with rank equal to N. 

B. Dimensionality-reducing ensemble 

We introduce an ensemble of L x M random unitary matri- 
ces, $ where L < N and $$* = I L , where I L is the L x L 
identity matrix. The multiplication of the data matrix by the 
unitary matrix results in a data matrix of reduced dimension, 
L x N, which in turn produces a statistically meaningful 
sample covariance matrix provided that L is sufficiently small 
compared with N, 



N 



(2) 



We need to specify the distribution of the random unitary 
matrix. One possibility would be to use a random permutation 
matrix, the effect of which would be to discard all but L of 
the M components of the data vectors. Instead we utilize the 
Haar measure (sometimes called the "isotropically random" 
distribution [11]). A fundamental property of the Haar distri- 
bution is its invariance to multiplication of the random unitary 
matrix by an unrelated unitary matrix. Specifically, let p($) be 
the joint probability density for the components of the unitary 
matrix, and let be any unrelated MxM unitary matrix (i.e., 



either 8 is deterministic, or it is statistically independent of 
<]>). Then $ has Haar measure if and only if for all unitary 9 



=p($). 



(3) 



Compared with the random permutation matrix, the Haar 
measure is more flexible as it permits linear constraints to 
be imposed. 

C. Two nonsingular covariance estimates 

The generation of the ensemble of reduced-dimension co- 
variance estimates (2) is well-motivated. It is less obvious 
what to do with this ensemble. We have investigated two 
approaches: cov which yields directly a non-singular estimate 
for the M x M covariance matrix, and invcov which yields 
directly an estimate for the inverse MxM covariance matrix. 

1) cov: If we project the L x L covariance (2) out to a 
MxM covariance using the same random unitary matrix, 
and then take the expectation over the unitary ensemble, we 
obtain the following: 



cov L (K) = ($* . 



(4) 



cov L (K) 



(M 2 - 1)M 



This expectation can be evaluated in closed form (either by 
evaluating fourth moments, or by using Schur polynomials as 
shown later): 

\(ML-1)K+(M-L)Tt(K)Im ■ 

(5) 

Thus the procedure cov is equivalent to diagonal loading for 
a particular pair of loading parameters. The dimensionality 
parameter, L, determines the amount of diagonal loading. It 
is reasonable to re-scale the covariance expression (5) by 
the factor M/L because the dimensionality reduction yields 
shortened feature vectors whose energy is typically L/M times 
the energy of the original feature vectors. Note that we use the 
term energy to denote the \\x\\?, of a vector x. If the covariance 
is scaled in this manner then the trace of the sample covariance 
is preserved. 

Although it is both interesting and surprising that cov is 
equivalent to diagonal loading, we instead pursue an approach 
which is better motivated and which promises more com- 
pelling action. 

2) invcov; We first invert the L x L covariance (2) (which 
is invertible with probability one), project out to MxM using 
the same unitary matrix, and then take the expectation over the 
unitary ensemble to obtain the following: 

invcov L (if) (&K&*)- 1 ®). (6) 

The estimate invcov (as well as cov) preserves the eigen- 
vectors. In other words, if we perform the eigenvector and 
eigenvalue decomposition, 



K = UDU*, 



(7) 



where D is the MxM diagonal matrix, whose diagonals are 
the eigenvalues, ordered from largest to smallest, and U is 
the MxM unitary matrix of eigenvectors, then we prove (in 
Section IV) that 



invcov l(K) = Uinvcov l(D)U* . 



(8) 
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Therefore it is enough to compute invcovi(Z)). We also show 
that invcovi(D) is a diagonal matrix. Moreover, we show that 
if D = diag(Z?jv, Om-n) where D N = diag(d l7 . . . ,d N ) is 
the matrix with the non-zero entries. The matrix invcovi(D) 
is a diagonal matrix that can be decomposed as 

invcov L (D) = diag(Ai, . . . , \ n ,(iI M -n)- 

In other words all the zero-eigenvalues are transformed to 
a non-zero constant fi. In Section VI we prove an exact 
expression for the entries of invcovi(D). More specifically 
we prove that 



\i = Tr 



where the average is taken over the ensemble of all N x L 
Gaussian random matrices X with independent and complex 
entries with zero mean and unit variance. Proposition 1 (in 
Section VI) gives us an explicit formula for fi. On the other 
hand, using Lemma 1 (in the same Section) we prove that 

d_ 
dd k 



Trlog($*£>jv$) d<p. 



where 



/ 

Jn, 



THog($*Djv$) # 



can be explicitly computed using Theorem 1 in Section VI. 
Therefore, given D we obtain close form expressions for all 
the entries of the matrix invcovi(Z?) for every M, N and L. 

In Section VIII using Free Probability techniques we prove 
asymptotic formulas for the entries of invcovi(D) for large 
values of N. 

We focus the remainder of the paper on some potential 
applications of invcov, the derivation of its fundamental prop- 
erties, and how to compute it. 

III. Potential Applications of invcov 

Typically neither the covariance matrix nor its inverse is of 
direct interest. Rather some derived quantity is desired. Here 
we discuss three potential applications where invcov arises in 
a natural way. 

A. Design of a linear estimator from training data 

The problem is to design a minimum mean square linear 
estimator for a M x x 1 random vector x given an observation 
of a My x 1 random vector y. Exact statistics are not available; 
instead we have to work with statistics that are estimated from 
a set of training data. If the statistics were available then the 
optimum estimator would be (assuming that the vectors have 
zero-mean) 

x(y) = K xy K- x y, (9) 



where K y is the covariance matrix of vector y and K xy is the 
cross-co variance matrix of vectors x and y. In this case, the 
mean-square error is 



MMSE = E ((x(y) - x) (x(y) - x)*) 

— K x K X yKy Kyx- 



(10) 



For the design of the estimator we have training data compris- 
ing N independent joint realizations of x and y: X (M x x N) 
and Y (M y x N), where N < M y . 

We introduce an ensemble of L x M y isotropically random 
unitary matrices, $, where L < N. We reduce the dimen- 
sionality of the observed vector, y — > <f>y, and the training 
set, Y — > $F, and we estimate the relevant covariances as 
follows, 

K Xt *y = ±- T xv&; (ii) 



N 



K. 



<S>y 



1 

N 



$yy*$*. 



(12) 



We estimate x given the reduced observation <£>y by treating 
the covariance estimates (11) and (12) as if they were correct: 



x(®y) = K x ^ y K^y 

= xy*<p* ($ry*$*) _1 $y. 



(13) 



The mean-square error of this estimator conditioned on the 
random unitary matrix, $, is found by taking an expectation 
with respect to the training data, {X, Y}, the observation, y, 
(which is independent of the training data), and the true value 
of the unknown vector, x: 



E{[x($y)-x][x($y) 



(14) 



K x - K xv §* (QKy®*)- 1 $K yx ] [1 +E(tr ({V* V)- 1 ) )} 



where V is a N x L random matrix comprising independent 
CN(0,1) random variables. We note the asymptotic result, 



E 



(tr^vr 1 )) 



N,L^cc 



N-L 



(15) 



The mean-square error (14) is equal to the product of two 
terms: the mean-square error which results from performing 
estimation with a reduced observation vector and with exact 
statistics available, and a penalty term which account for 
the fact that exact statistics are not available. The first term 
typically decreases with increasing dimensionality parameter, 
L, which the second term increases with L. 

Instead of performing the estimation using one value of 
the dimensionality-reducing matrix, one can average the 
estimator (13) over the unitary ensemble: 

x(y) = E*(£($y)) 

= xy*E 4> ^*($yy*$*)- 1 $)y 

= XY* ■ invcov L (yy*) • y. (16) 

Jensen's inequality implies that the ensemble-averaged estima- 
tor (16) has better performance than the estimator (13) that is 
based on a single realization of $, 



E([E*(x($y))-x] • pMz($y))-z]*) 
£$(£( [x{$y) - x] [x(®y) - x 



< 



(17) 
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B. Supervised learning: Design of a quadratic classifier from 
training data 

The problem is to design a quadratic classifier from labeled 
training data. Given an observation of a M x 1 zero-mean 
complex Gaussian random vector, the classifier has to choose 
one of two hypotheses. Under hypothesis Hj, j = 0, 1, the 
observation is distributed as CN(0, Kj), j — 0,1. If the two 
covariance matrices were known the optimum classifier is a 
"likelihood ratio test" [15], 



Ho 



(18) 



where 7 is a threshold. Instead the covariances have to be 
estimated from two M x N matrices of labeled training 
data, Xj, j = 0, 1, each of which comprises N < M 
independent observations of the random vector under their 
respective hypotheses. 

We introduce an ensemble of L x M random unitary 
matrices, $, where L < N. For a given $ we reduce the 
dimension of both sets of training data and then estimate the 
reduced covariance matrices, 



K j = -zx j x;&,j = o,i. 



(19) 



For any $ we could implement a likelihood ratio test based 
on the estimated reduced covariances (19) and the reduced 
observation, $x. Alternatively we could base the hypothesis 
test on the expectation of the log-likelihood ratio with respect 
to the unitary ensemble, 

-x*E< s ,($* ({QX^l®*)- 1 - (SAb-X^**) -1 )*)^ 

Hi 

—x* (invcovi(XiXJ' )-mvcov L (X X*))x > 7. (20) 

H 

This classifer is of the "naive Bayes" type [16], in which 
statistical dependencies (in this case the individual likelihood 
ratios are not statistically dependent) are ignored in order to 
simplify the construction of the classifier. 



C. Capon MVDR spectral estimator 

The Capon MVDR (minimum variance distortionless re- 
sponse) spectral estimator estimates power as a function of 
angle-of-arrival given N independent realizations of a M- 
dimensional measurement vector from an array of sensors [17]. 
Let X be the M x N vector of measurements, M < N, and 
let the "steering vector", a, be the M dimensional unit vector 
which describes the wavefront at the array. The conventional 
power estimate, as a function of the steering vector, is 



a* Ka, 



(21) 



where K is the sample covariance matrix. The Capon MVDR 
power estimate is 

1 



^Capon — 



apon " a*K-ia 



(22) 



A justification for the Capon estimator is the following: one 
considers the estimated covariance matrix to be the sum of 
two terms, the first corresponding to power arriving from the 
direction that is specified by the steering vector, and the second 
corresponding to power arriving from all other directions, 



K = P ■ aa* + K othcl . 



(23) 



It can be shown that the Capon power estimate (22) is equal to 
the largest value of power P such that, in the decomposition 
(23), Mother is nonnegative definite [1]. In other words the de- 
composition (23) is nonunique, and the Capon power estimate 
is an upper bound on the possible value that the power can 
take. 

We deal with the singularity of the covariance matrix by 
introducing an ensemble of L x M unitary matrices, $. 
Since we are looking for power that arrives from a particular 
direction we constrain the unitary matrices to preserve the 
energy of the steering vector, i.e., a*$*<I>a = a*a = 1. This 
is readily done through a Householder unitary matrix, Q, such 
that 



Q = 



(24) 



where Aj_ is a M — 1 x M unitary matrix whose rows are 
orthogonal to a. We represent the ensemble $ as follows: 



$ = 



(25) 



where is a L — 1 x M — 1 isotropically random unitary 
matrix. We now use the constrained unitary matrix $ to reduce 
the dimensionality of the sample covariance matrix and the 
steering vector, we compute the Capon power estimate from 
the reduced quantities, and finally we average the power with 
respect to the unitary ensemble [2]: 



1 



a *$* ($#$*) 1 $ a 
= a*Ka- a*KA* L mvcov L - 1 (A ± KA* L )Aj_Ka 



(26) 



where 



mvcav L - 1 (A±KA* ± ) = E @ (q* {QA±KA* ± Q*)~ 1 e). 



(27) 



D. Distantly related research 

Our approach to handling covariance singularity is based 
on an ensemble of dimensionality-reducing random unitary 
matrices. Here we mention some other lines of research which 
also involve random dimensionality reduction. 

1 ) Johnson-Lindenstauss Lemma: In qualitative terms, the 
Johnson-Lindenstrauss Lemma [13] has the following impli- 
cation: the angle between two vectors of high dimension tends 
to be preserved accurately when the vectors are shortened 
through multiplication by a random unitary dimensionality- 
reducing matrix. 



5 



2) Compressive Sampling or Sensing: Compressive sam- 
pling or sensing permits the recovery of a sparsely-sampled 
data vector (for example, obtained by multiplying the original 
vector by a random dimensionality-reducing matrix), provided 
the original data vector can be linearly transformed to a 
domain in which it has sparse support [14]. Compressive 
sampling utilizes only one dimensionality-reducing matrix. 
In contrast our approach to handling covariance singularity 
utilizes an ensemble of random dimensionality-reducing ma- 
trices. 

IV. Derivation of Some Basic Properties of invcov 

In this Section we state and prove two basic and fundamen- 
tal properties of invcov^if). We perform the eigenvector and 
eigenvalue decomposition, 



K = UDU*, 



(28) 



where D is the M x M diagonal matrix, whose diagonals are 
the eigenvalues, ordered from largest to smallest, and U is the 
M x M unitary matrix of eigenvectors. 

A. Eigenvectors of sample covariance are preserved 

We substitute the eigenvalue decomposition (7) into the 
expression (6) for invcov l(K) to obtain the following: 



= E$\U(<f>Uy (<S>UDU*<5>*) 1 ($[/)[/' 
= C/E$ ($£><P*) _1 $^ U* 



= [/invcovL (D)U* 



(29) 



where we have used the fundamental definition of the isotropic 
distribution (3), i.e. that the product $£/ has the same dis- 
tribution as $. We intend to show that invcovi(D) is itself 
diagonal. We utilize the fact that a matrix A is diagonal if and 
only if, for all diagonal unitary matrices, f2, FLAW = A. Let 
il be a diagonal unitary matrix, we have 

f2 invcovi(D) fi* = 

= E <E ,(($fi*)*(($f7*)OL>0*($f7*)*) _1 ($f7*)) 

= 

= invcov L (D) (30) 

where we used the fact that $ST has the same distribution as 
$, and that QDQ* = D. Therefore we have established that 
the final expression in (29) is the eigenvector/eigenvalue de- 
composition of invcov^if), for which the eigenvector matrix 
is U and the diagonal matrix of eigenvalues is invcovi(D). 
Hence, we need only consider applying invcov to diagonal 
matrices. 

B. The zero-eigenvalues of the sample covariance are con- 
verted to equal positive values 

When the rank of the covariance matrix is equal to N < M, 
the eigenvalue matrix of K has the form 



D 



D N 






Om- 



N 



We want to establish that the last M — N eigenvalues of 
invcovL(if) are equal. To that end we introduce a unitary 
matrix, 5, 

" In 
P M -n 



(32) 



where Pm-n is an arbitrary M — N x M — N permutation 
matrix. We now pre- and post-multiply invcovi(D) by S and 
S* respectively: it will be shown that this does not change the 
diagonal matrix, so consequently the last M — N eigenvalues 
are equal. We have 

S invcov L (L>) E* = E E$ (V (^D**)- 1 *) E* 

= e$ (V r 1 ^) 

= invcovL(-D), (33) 

where we used the fact that $S* has the same distribution as 
$, and that EDE* = D. 

V. Functional Equation 

In this Section we will prove a functional equation for the 
inverse covariance estimate invcov l(K)- 

Let K be an M x M sample covariance matrix K of rank 
N. Since K is positive definite there exists U an M x M 
unitary and D an M x M diagonal matrix of rank N such 
that K = UDU*. Fix L < N. We would like to compute, 

invcov L (^) = E($*($if$*)" 1 $) (34) 

where $ is an L x M unitary matrix and the average is 
taken with respect to the isotropic measure. Let Z be an 
L x M Gaussian random matrix with complex, independent 
and identically distributed entries with zero mean and variance 
1. It is a well known result in random matrix theory (see [6]) 
that we can decompose Z — C<f> where C is an L x L positive 
definite and invertible matrix (with probability one). Hence, 
Z*{ZKZ*)~ 1 Z = Therefore, 

mvcov L (K) = E^Z*(ZKZ*)- 1 Zy (35) 

Moreover, as shown in the previous Section 

invcov L (if) = \m(z*(ZDZ*)- x Z}U*. (36) 

Therefore it is enough to compute E(Z* (ZDZ*)~ 1 Z). De- 
compose Z as Z = [X, Y] where X is L x N and Y is 
Lx(M-N). Now performing the block matrix multiplications 
and taking the expectation we obtain that invcov L (D) is equal 
to 



E(X*(XD N X*)- 1 X) 






E(Y*(XD N X*)- 1 Y) 



(37) 



(31) 



where D — dia,g(D N ,0 M _ N ) and D N an TV x N diagonal 
matrix of full rank. 

Let us first focus on the NxN matrix E(X*(XD N X*)~ 1 X), 
denote this matrix by 

A L (D N ) :=E(X*(XD N X*)- 1 X). (38) 
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Let W be the matrix W = XD^ 2 . Then 



A L (D N ) = E(X*(XD N X*)- L X) 

= D N 1/2 E(W*(WW*)~ 1 W)D 



(39) 



-1/2 



N 



and 



MW*(WW*)-^W) = lim E W*(jIl + WW*) W 

7— s-o 



I N - lim E 

7^0 



[i N + -w*w)~ 

1 



Therefore it is enough to compute 



lim E 

7^0 



which is equal to 



lim D N 1/2 E 
7— s-o 



(V 



1 

1 



-x*x 



D 



-1/2 



N 



Let us decompose X* X = QD x 9l* where f2 is an Af x TV 
unitary matrix and D x is an AT x N diagonal matrix of 
rank L. Then D x = diag(D , Qn-l)- It is a straightforward 
calculation to see that 



D N 1/2 E 



X*X 



D 



-1/2 



it is equal to 

D N 1/2 E 



-1/2 



AT 



Doing the block matrix decomposition 



An A 12 
A21 A 2 2 



where An is L x L and A 2 2 is (N — L) x (N — L) we see 
that 



A21 

X Y 
Z W 



A, 



where X = (A n + ^D -A 12 A 22 1 A 21 ) , Y = -XA 12 A 22 \ 
Z = -A 22 A 21 X and W = A 22 + A 22 A 21 XA 12 A 22 . Since 
lim 7 ^ X = we see that 



/ 1 \ _1 

lim [n*DZ 1 n + -D x ) 

7->0 V JV 7 / 




A; 



7-5-0 V 7 

Putting all the pieces together we obtain that 





A L (D N ) = D N 1 -D N 1 -E fi 



A; 



ft* -Z?" 1 (40) 



where fi is an isotropically distributed N x N unitary matrix 
and 

An A 12 
A21 A 22 



Let us decompose the unitary matrix ft as ft — Cl 2 ], where 
51i is x L and VL 2 is N x (N-L) matrix. Then f^fii = 7 L 



and ^2^2 = In-l isotropically distributed unitaries. It is an 
easy calculation to see that 



E ft 





A 22 x 



ft* 



E^ft*^ 1 ^) 'O^ 
An-l^ 1 ). (41) 



Therefore, using equation (40) and equation (41) we found the 
following functional equation 



D N A L {D N ) + D^An^D^) = I N . 



(42) 



Remark 1: Here we list a few results on Al(Dn). 

1) If TV = L then D N A N (D N ) = I N and therefore 
A N {D N ) =D^. 

2) It is not difficult to see, and well known result on random 
matrices, see [6], that Al(In) = j^In- Therefore, 

A L {aI N ) = ^I N . (43) 

Hence, 

aI N A L (aI N ) + a~ 1 I N A N - L (a~ 1 I N ) = I N 

which agrees with equation (42). 

3) Tr{D N A L (D N )) = Tr(E(£) Ar X*(X J D^X*)- 1 X)) - 
L where in the last equality we used the trace property. 

As we saw in Equation (37) the other important term in 

invcovi(-D) is 

E(Y*(XD N X*)- 1 Y). 
Let us define [i > as 



Tr 



EUXDnX*)- 1 ) 



(44) 



Since X and Y are Gaussian independent random matrices it 
is clear that 



E(Y* (XD N X*)~ 1 Y) = TtIeUXDnX*)- 1 ) 



im-n 



M-Ni 



where Im-n is the identity matrix of dimension M — N. 
Putting all the pieces together we see that the estimate 
invcovi(D) is equal to 



invcov L (Z)) = diag(A L (D N ),/j,I M _ N ). 



VI. FCOV EXACT FORMULA 



(45) 



In this Section we will prove an exact and close form expres- 
sion for the entries of invcovL(D). We will treat separately 
the entries of Al(Djs[) and the constant term [i. As a matter 
of fact the analysis developed in this Section will allow us to 
obtain close form expressions for more general averages. 

Recall that we say a matrix A is said to be normal if it 
commutes with its conjugate transpose A A* = A* A. Given a 
normal matrix A and / : C — > C a continuous function we 
can always define f(A) using functional calculus. Being more 
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precise we know by the spectral Theorem that exist U unitary 
and D = diag(c?i, . . . , d^) such that 



We then define 



A = UDU*. 



f(A) = UD f U* 



where Df = diag(/(rfi), . . . , /(djv))- In particular, let D be 
as before and let / be a continuous function, we will obtain 
an exact expression for 

fcovi(D) :=e($*/($D$*)*) (46) 

where <3? is an L x N unitary isotropically random. Note that 
our covariance estimate invcovL(D) is a particular case of the 
last expression when f(x) = x~ x . 

Let VL L . N = {$ e C NxL : $*$ = I L } be the Stiefel 
manifold with the isotropic measure d<p. By equation (18) in 
[3] we know that 

s x {*'D N *W= a ^ r a f L) (47) 

where s\ is the Schur polynomial associated with the partition 
A. The latter are explicitly defined for any N x N matrix A 
in terms of the eigenvalues ai, . . . , ajv as 

s x (A) = s x (a u ...,a N )= \ N '^~\ (48) 

with A being a partition, i.e. a non-increasing sequence of 
non-negative integers Xj. For an introduction to the theory of 
symmetric functions and properties of the Schur polynomials 
see [4] and [5]. 

Denote by (n — k, l k ) the partition (n — k, 1, 1, . . . , 1) with k 
ones. One of the properties of the Schur polynomials is that 



N-l 

fe=0 



Tr(A n ) 

Using equation (47) and (49) we see that 
J Tr((&D N *) 



(49) 



L,N 



is equal to 



L-l 

E 

k=Q 



(-1) 



TL — k,l k ) (II) 
s (n-kA k )(lN) 



(50) 



The constant s ( „_ M *)(/ p ) = fci^Jfc-^^-fc-ijin see 
Therefore, 



!(p-fc-l)!(n-fc-l)!: 
«(n-fc,l*)(JL) = (n + L- (fc + 1))! (JV- (fc + 1))! 

*(n-fc,i*)( J ") ~ (« + ^ - (k + 1))! ' (L - (k + 1))! 



(51) 

For each p > consider the operator defined in x n by 
j(p)(x n ) — ( n+ ^ — ( n+p ) ■ This extends linearly and continu- 
ously to a well defined linear operator 1^ : C[0, r] — > C[0, r] 
where C[0, r] are the continuous functions in the interval [0, r]. 
Now we are ready to state the main Theorem of this Section. 



Theorem 1: Let be an N x N diagonal matrix of rank 
7Y. For any continuous (complex or real valued) function / e 



is equal to 



L-l 

E 

fc=0 



li:(/(**£)^$)) dcj> 



(N-(k + l))\ det(G fc ) 



(52) 



(L-(fc + l))! det(A(D Af )) 



where A(Dat) is the Vandermonde matrix associated to Dn 
and Gfe is the matrix defined by replacing the (k + 1) row of 
the Vandermonde matrix A(D N ), {df ~^ +1 ^}™ =1 , by the row 



{l^ N - L \x^-^f(x))\ x=di }\ 

Proof: By linearity and continuity (polynomials are dense 
in the set of continuous functions) it is enough to prove (52) 
in the case f(x) = x n . By (50) and (51) we know that 



f Tr(($*L»Ar$) rl ) d<j> 



is equal to 



L-l 



E(-d 



k JN,L) 



s (n-kA k ){DN) 



(53) 



k=0 



where 



(JV.L) = (JV -(<; + !))! (n + £-(fc + l))! 
Cfe ■ (L-(fc + l))! ' (n + iV- (fc + 1))!' 

By definition of the Schur polynomials (see [4]) 

det(S fc ) 



s (n-fc,l fc )( £) A r ) 



dct(A(D N )) 



where the i^-column of the matrix Sk is 



/ d?- 1+n - k \ 

d N-2+l 



JV-(fc+l)+l 



AT-(fe+2) 



.JV-(JV-l) 
1 



Doing fc row transpositions on the rows of the matrix Sk we 
obtain a new matrix Sk where the i'^-column of this new 
matrix is 
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df~ 3 



N-k 



N+n-(k+l) 



d. 



AT-(fc+2) 



V 



1 



Proof: Let 1 < L < N and let us consider f(x)=x 1 
then for each < fc < L — 2 we see that 

JV-(fc+2) 

r(JV-i)/ L-(fe+l) -hi _ fl 1 

1 ; l«=* (L-k-l)...(N-k-2) 

since this is a multiple of the (k+2) th row of the Vandermonde 
matrix A(D N ) we see that det(G fe ) = for all < k < L-2. 



log(di) 



/ 



Note that this matrix Sk is identical to A(Djv) except of the 
(fc + 1) row which was replaced by the row {df +™-(' c + 1 ) j.a^ 
instead of {df ~ (fc+1) }£i as in A(D W ). Therefore, 



It is not difficult to see that for k = L — 1 

,N-(L+1) 

j(N-L), -1)1 = Oj 

Therefore using Theorem 1 we see that 

det(G) 



/ Tr((<f>* D N $)- 1 )d<j> = (JV-L) • 



Tr 



L-l 



is equal to 

fe=0 

Now and using the fact that 



(n,l) det(5 fc ) 



det(A(£»^))' 



(54) 



/-(JV-^^n+i-Cfe+l)^ (n + £ - (fc + 1))! jV +ra -(fc+l) 

( j| -= d . (n + iV- (fc + 1))! 1 



we obtain that 



(n + L- (fc + 1))! 



(n + JV- (fc + 1))! 
Putting all the pieces together we obtain that 



det(A(^)) (5?) 

where G is constructed by replacing the L th row of the 
Vandermonde matrix by the row 

(dr^iog^),-.-,^"^^^)). 

Now we will prove the second part of the Proposition. Given 
X an N x L random matrix as in the hypothesis we can 
decompose X as 

X = <$>C 

where $ is an isotropically random N x L unitary and G is 
a positive definite L x L independent from $ (see Section 
2.1.5 from [6]). The matrix G is invertible with probability 1. 
Hence, 

(X*D N X)- 1 = (C&DnQC)- 1 = C-^fc'Div*)- 1 ^ 1 . 
Therefore, 



/ 

Jn, 



Tr(($*£>jv$) n ), 



E 



T^pTAvX)- 1 ) 



is equal to 



Tr(C- 1 ($*D N $)- 1 C- 1 ) 
Tr(G _2 ($*£)jv^) _1 ) 



L-l 

£ 

fe=0 



(JV-(fc + l))! det(G fe ) 



(L -(& + !))! det(A(Djv)) " 



where in the second equality we used the trace property. Recall 
from random matrix theory that if A, B are independent n x n 
random matrices then 



The next Proposition gives us, in particular, an exact and close 
form expression for the term \i in equation (44) discussed 
previously. 

Proposition 1: Let Dm be an N x TV diagonal matrix of 
rank N. Let 1 < L < N and X be an N x L Gaussian 
random matrix with independent and identically distributed 
entries with zero mean and variance 1. Then 



E 



Tr(AB) 



n 



Tr(A) E Tr(B) 



Since the matrix G is independent with respect to $ and 
we conclude that 



E 



Tt^DnX)- 1 ) 



is equal to 



/ 



and 



1 



E 



Tr(G" 



E 



Tr(($* J D Ar $)" 



/j, :-- 



E 



Ti{(X*D N X)- 1 ) 



det(G) 



Let D N = I N then (X*D N X)- 1 = {X*X)~ 1 = G~ 2 and it 
is known (see Lemma 2.10 [6]) that 

L 



(56) 



E 



det(A(D N )) 

where G is the matrix constructed by replacing the L th row 
of the Vandermonde matrix by the row 



Tr(G~ 



N — L 

Putting all the pieces together we see that 



/j, := 



E 



(4 



N-(L+1) 



log(di ),..., d% 



^logldjv))- 



Tt^DnX)- 1 ) 



det(G) 
det(A(D JV )) 



(58) 



finishing the proof. 
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A. Application of the Main Results to cov l(D) 

Using Theorem 1 we can compute several averages over the 
Stiefel manifold. As an application let us compute 

cov L (D)= f $($*D*)#*d0 = diag(ti,...,tM). 

(59) 

First let us note that the same proof in Lemma 2 gives us the 
following result. 

Lemma 1: Let / be a differentiable function in the interval 

[d m in,d max ] where d min = rnin{dj} and d max = max{di}. 
Then the following formula holds, 

_d_ 
dd k 



B. Application of the Main Results to invcov l{D) 

As we mention in the introduction and subsequent Sec- 
tions the problem we are mainly interested is to compute 
invcovL(D) when D is an M x M diagonal matrix of 
rank N and L < N. Let D = diag(£>jv, Om-n) where 
Dn — diag(di, . . . , dw). As we previously saw invcovi(D) 
is a diagonal matrix that can be decomposed as 



invcov L (D) = di&g(A L (D N ), [iIm-n) 



f Tr(f(<f>*D N $)) d<\> (60) 
Jn L . N v ' 



where 

Al(D n ) 

and 



$($*Djv$) _1 $* 



diag(Ai, . . . , Ajv) 



is equal to 



/ 



$f'(<f>*D N 



where Akk means the (fc, k) entry of the matrix A. 

Let D an M x M diagonal matrix of rank N and let L < N. 
By equation (59) and Lemma 1 we know that 



Since 



/ 

Jn, 



L L + l 

T 

L L-l 

M ' M^T S(1,1) 



where 



and 



Then 

9S(2,0)(D) 



M 



i—l i<j 



S (M)(£>) =Ys d idr 

i<j 



Tr(L>) + d k and 



ddk 

Therefore, t k is equal to 
L + l 



1 L 

2 ' M 

hence 



dd k 



l l (d k + Tr(D))-^ i {Tr(D)-d k ) 



Tr(D) - d k 



M 



tk = 



L 

M 



M -L , N ML — 1 , 



M 2 - 1 



M 2 - 1 



Therefore, 

cov L (D) 



M 



M — L , N ML - 1 

Tr(£») • / M + — • £> 



Af 2 - 1 



M 2 - 1 



/i = Tr 



E((X*^ A rX)- 1 ) 



where the average is taken over the ensemble of all N x L 
Gaussian random matrices X with independent and complex 
entries with zero mean and unit variance. 



Trlog($*£>iv*) d(j). 



Using Lemma 1 we see that 

^ k dd k 

where 

Trlog($*£>jv$)# 



/ 

Jn, 



can be explicitly computed using Theorem 1. On the other 
hand, Proposition 1 gives us an explicit formula for fi. There- 
fore, given D we obtained close form expressions for all the 
entries of the matrix invcov l{D) for every M, N and L. 

C. Small Dimension Example 

In this subsection we will compute invcovi(D) for a small 
dimensional example. Let M — 4, N = 2 and L = 1 and 
D = diag(di, d 2 , 0, 0). Hence following our previous notation 
D 2 = diag(rfi, d 2 ). Let us first consider the case d\ = d 2 = d. 
In this case using equation (43) and equation (77) we see that 

invcovi(L>) = diag^d- 1 /2, d- 1 /2, d- 1 , d- 1 ). 

The more interesting case is d\ ^ d 2 . Applying Theorem 1 
and Lemma 1 we see that 



invcovi (D) = diag( Ai ,\ 2 ,p,,p) 



where 



A, 



d 



ddi Ja 



Trlog($*L> 2 $)d0 

d I det(H) \ 
0d^det(A(D 2 )) ) 



where 



and 



A(D 2 ) = 



d\ d 2 
1 1 



H 



di logdi 
1 



di d 2 log d 2 - d 2 
1 
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Doing the calculations we see that 

d 2 log d 2 - d 2 log d\+d\ — d 2 



and 



Ai 



A 2 = 



(di - d 2 ) 2 
di log di - di log d 2 + d 2 - d\ 



(di - d 2 ) 2 
On the other hand using Proposition 1 



(61) 
(62) 



/i = Tr 



E((XD 2 X*)- L ) 



det(G) 
det(A(£> 2 )) 



where 



G 



Therefore, 



/i 



log di log d 2 

1 1 



log di - log d 2 



(63) 



di - d 2 

To summarize, computation of invcov is facilitated by first 
taking the eigenvector and eigenvalue decomposition of the 
sample co variance matrix and then applying invcov to the 
diagonal eigenvalue matrix. At that point there are three 
alternatives: a straight Monte Carlo expectation, the asymptotic 
expressions (76) and (83), or the closed form analytical 
expressions according to Proposition 1 and Theorem 1. 

VII. Preliminaries on the Limit of Large Random 
Matrices 

A random matrix is a measurable map X, defined on some 
probability space (fi, P) and which takes values in a matrix 
algebra, M„(C) say. In other words, X is a matrix whose 
entries are (complex) random variables on (fi, P). One often 
times identifies X with the probability measure X(P) it 
induces on M„(C) and forgets about the underlying space 
(Q,P). Random matrices appear in a variety of mathematical 
fields and in physics too. For a more complete and detailed 
discussion of random matrix limits and free probability see 
[21], [20], [18], [19] and [6]. Here we will review only some 
key points. 

Let us consider a sequence {A^j-NgN of self-adjoint NxN 
random matrices A N . In which sense can we talk about the 
limit of these matrices? It is evident that such a limit does not 
exist as an oo x oo matrix and there is no convergence in the 
usual topologies. What converges and survives in the limit are 
the moments of the random matrices. Let A — (a^ 
where the entries are random variables on some probability 
space £1 equipped with a probability measure P. Therefore, 

1 N f 

E(tv N (A N )) -=jrJ2 a u (w)dP(u;) (64) 
i=i 

and we can talk about the fc-th moment E(trN(-4^)) of our 
random matrix An, and it is well known that for nice random 
matrix ensembles these moments converge for N — > oo. So 
let us denote by at the limit of the fc-th moment, 



a k := lim E(tr N (A k N )). 

N— >ca 



(65) 



Thus we can say that the limit consists exactly of the collection 
of all these numbers a^. However, instead of talking about a 
collection of numbers we prefer to identify these numbers as 
moments of some random variable A. Now we can say that our 
random matrices An converge to a variable A in distribution 
(which just means that the moments of A N converge to the 
moments of A). We will denote this by An — > A. 

One should note that for a self-adjoint NxN matrix 
A = A*, the collection of moments corresponds also to a 
probability measure \ia on the real line, determined by 



Jr 



tr N (A") = / t«dvA(t). 



(66) 



This measure is given by the eigenvalue distribution of A, i.e. 
it puts mass on each of the eigenvalues of A (counted with 
multiplicity): 



Ha 



1 N 



(67) 



where Ai, . . . , Ajv are the eigenvalues of A. In the same way, 
for a random matrix A, \ia is given by the averaged eigenvalue 
distribution of A. Thus, moments of random matrices with 
respect to the averaged trace contain exactly that type of 
information in which one is usually interested when dealing 
with random matrices. 

Example 1: Let us consider the basic example of random 
matrix theory. Let G N be an N x N self-adjoint random 
matrix whose upper-triangular entries are independent zero- 
mean random variables with variance and fourth moments 
of order 0(^2 ). Then the famous Theorem of Wigner can be 
stated in our language as 

Gn — > s, where s is a semicircular random variable, 

(68) 

where semicircular just means that the measure /i s is given 
by the semicircular distribution (or, equivalently, the even 
moments of the variable s are given by the Catalan numbers). 

Example 2: Another important example in random matrix 
theory is the Wishart ensemble. Let X be an TV x K random 
matrix whose entries are independent zero-mean random vari- 
ables with variance and fourth moments of order O(jp). 
As K, N -> 00 with § -> 0, 



X*X 



(69) 



where xp is a random variable with the Marcenko-Pastur law 
Hp with parameter /?. 

The empirical cumulative eigenvalue distribution function 
of an N x N self-adjoint random matrix A is defined by the 
random function 



#{k : A fc < x} 
N 



where are the (random) eigenvalues of A(uj) for each 
realization u. For each u this function determines a probability 
measure /Ujv(w) supported on the real line. These measures 



11 



{hn((jj)}u, define a Borel measure /jjv in the following way. 
Let B C R be a Borel subset then 



A new and crucial concept, however, appears if we go over 
from the case of one variable to the case of more variables. 

Definition 1: Consider N x N random matrices 



Affl and variables A\,...,A m (living in some 



abstract algebra A equipped with a state ip). We say that 

O^am ■ • ■ , A^ N ') — > (Ax, . . . , A m ) 
in distribution if and only if 

lim eUxJa% i) ■■■A% k) \\=p{A^---A^) (70) 
for all choices of k, 1 < ii, . . . ,i k < m. 

The Ax, ... , A m arising in the limit of random matrices are 
a priori abstract elements in some algebra A, but it is good to 
know that in many cases they can also be concretely realized 
by some kind of creation and annihilation operators on a full 
Fock space. Indeed, free probability theory was introduced by 
Voiculescu for investigating the structure of special operator 
algebras generated by these type of operators. In the beginning, 
free probability had nothing to do with random matrices. 

Example 3: Let us now consider the example of two in- 
dependent Gaussian random matrices G^,G^ (i.e., each of 
them is a self-adjoint Gaussian random matrix and all entries 



of G$ are independent from all entries of G y ^' ). Then one 
knows that all joint moments converge, and we can say that 



(2) 



) -> {sx,s 2 ), 



(71) 



where Wigner's Theorem tells us that both Sx and s 2 are 
semicircular. The question is: What is the relation between 
Sx and s 2 ? Does the independence between G$ and G$ 
survive in some form also in the limit? The answer is yes and 
is provided by a basic theorem of Voiculescu which says that 
Sx and s 2 are free. For a formal definition of freeness and 
more results in free probability see [21], [20], [18] and [19]. 

VIII. Asymptotic results 

In practical application when the values of M and N are too 
large the expressions of the previous Section could be difficult 
to evaluate and we need simpler mathematical formulas. It is 
for this reason that in this Section we will provide asymptotic 
results for the entries of invcov l(D). As we previously saw 
invcovi(D) is equal to 









K(Y*(XD N X*)- 1 Y) 



where D = diag(Z?Ar, Om-jv) and Dm an N x N di- 
agonal matrix of full rank. In this Section we will get 
asymptotic results for both terms E{X*{XD N X*)~ 1 X) and 
E{Y*{XD N X*)~ 1 Y) as N,L -> oo with Hindoo f = /?. 



Recall that \i > was defined as 



/j = Tr EUXDnX*)- 1 ) 



(72) 



As we already observed, if X and Y are Gaussian independent 
random matrices it is clear that 



E(y*(JD Jv x , )- 1 y) 



EUXDnX*)- 1 ) 



Tr 
I^Im-n 



l M-N 



Let us introduce the ^-transform of an x N random matrix 
A as, 



N 



Tr 



Edl + jA)- 1 ) 



for 7 > 0. 



Analogously, given v a probability measure on 
define the 77-transform of v as 



we can 



It is not difficult to see that 

lim i] u (-f) 

and 



7—^00 



K{o}) 



lim 7^(7) = t 1 dv{t). 



Consider D = (D N ) Ne fi a collection of diagonal N x N 
diagonal matrices such that D^ converge in distribution to a 
probability measure v, i.e. for all k > 



lim 

N^oo N 



^E(lV(^)) 



t k dv{t). 



Let (A^v)jV£N be a sequence of L x N complex Gaus- 
sian random matrices with standard i.i.d. entries. Then by 
Theorem 2.39 in [6] we know that j^X N D N X N converges 
in distribution (moreover, almost surely) as as N, L — > 00 
with limjv^oo ^ = (3 to a probability distribution whose 77- 
transform satisfies 



(73) 



where r\ v is the ry-transform of the limit in distribution of D = 
(D N ) NeN . Note that lim 7 _ >00 77(7) = and lim^oo 777(7) = 
7*. Therefore, taking limit as 7 goes to infinity on both sides 
of equation (73) we get 

(3-1 
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(74) 



since by definition and the convergence in distribution for N 
sufficiently large we have that 

1 



and 



The symbol 



-A Tr 
N-L 



E(l + ^Dn)- 1 



(75) 



N (3 ' 

denotes that equality holds in the limit. 



Using equation (74) and equation (75) we obtain 

N-L 



1 



N 

V 



1 



N 



N ^ 1 + nd k 



(76) 
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where Dm = diag(c?i, . . . , djv)- Note that this equation 
implicitly and uniquely defines \i > 0. 

Remark 2: 1) If N = L then /x = oo (which is not 
surprising). 

2) If d\ — ... = d N — a then /j = 
a - 1 TrE((XX*)- 1 ). Using Lemma 2.10 from [6] we 

know that TrE((XX*)- 1 ) = Therefore for this 
case 

" = (7?) 

which agrees exactly (without the approximation) with 
equation (76). 

Now we will compute the asymptotics of the other, and 
more complicated, term 

A L (D N ) = EiX^XDNX*)-^) = diag(Ai, . . . , X N ). 



Lemma 2: Let / be a differentiable function on the interval 



\d 



rmn ? "'m ax _ 



where d r< 



min{di\ and d n 



max{dj}. 



Then the following formula holds 

d 



dd, 



-E 



Ti(f(XDX*)) 



E 



X*f'(XDX*)X 



J kk 



(78) 



Proof: It is enough to prove this Lemma for the case / is 
a polynomial and by linearity it is enough to prove it for the 
case f(x) = x n . Let E k be the N x N matrix whose entries 
are all zero except the entry Ekk which is equal to 1 . Note that 

g^(D) - E k . If f(x) = x n then f(XDX*) = (XDX*) n 
and for each k it is easy to see that 

n-l 



d 



dd k 
Therefore, 



(XDX*) n = {XDX*yXE k X* {XDX*) n - {i+1) . 



j=0 



TT {w^ XDX *^j = nTr ( XE k x *( XDX *) n ~ 1 ) 
= Ti(E k X*f'{XDX*)XE^ 

= (x^nxDxnx)^ 



where we use the trace property and the fact that E\ — E k . 
Taking expectation on both sides we finish the proof. ■ 

As a corollary we obtain. 

Corollary 1: If 

A L (D N )=E(X*(XD N X*)- 1 X)=dmg(\ 1 ,... 1 \ N ) 



Tr \og(XD N X*) 



Before continuing let us define the Shannon transform of 
a probability distribution. Given v a probability distribution 
supported in K we define 

t?„(7) = / Iog(l+7*)di/. 
Jr 

It is easy to see that lim^oo ^^(7) — log(7) = J* R log(t)dv. 



Consider, as before, D = (D^Nefi a collection of diag- 
onal N x N diagonal matrices such that Dn converge in 
distribution to a probability measure v. Let {X N ) NeN be a 
sequence of L x N complex Gaussian random matrices with 
standard i.i.d. entries. Then by equation 2.167 in [6] we know 
that ^X N D N X N converges in distribution (moreover, almost 
surely) as as N, L — > 00 with lim w ^oo 77 = P to a probability 
distribution rip whose Shannon-transform 1) satisfies 

0(7) - PMmW) - Iog»?(7) + vd) - 1. (79) 

Subtracting log (7) on both sides and taking the limit 7 — »■ 00 
we obtain 



log(t) dvp = pd v (jj,) - log(/x) - 1. 



For N and L sufficiently large 



log{t)dvp= lim yE\Tilog(yX N D N X N ) 

L—KX) Li I \L / 

« ^(E[Tr logp^Dj^)] - Llog(L)) 



(80) 



(81) 



and 



/3 lim — E 

JV-J-OO N 



Tr log(l + nD N ] 



(82) 



AT 



iV 



^iog(i + Mfc)- 



Hence, for JV sufficiently large 



E 



AT 

Tr log(X N D N X* N )\ « J] log(l + a*4) + L log (— ) . 



fc=i 



Taking partial derivatives on both sides with respect to d k we 
obtain 



N 



L d/i 



dd k 1 + itdj 1 + ,ud fe /x <9d fe ' 



(83) 



IX. Simulations 

In this Section we present some of the simulations per- 
formed. Let S be a 100 x 100 true covariance matrix with 
Toeplitz with entries Sjj = exp(— — j\). Assume we 
take N ~ 50 measurements and we want recover S to the 
best of our knowledge. After performing the measurements 
we construct the sample covariance matrix K x . In Figure 
1 and 2 we can see the eigenvalues of the true matrix, 
the eigenvalues of the sample covariance matrix (raw data) 
and the eigenvalues of the new matrix after performing the 
randomization to the sample covariance for different values 
of L. In other words we are comparing the eigenvalues of 
S, K x and of invcovi(if x ) _1 for different values of L. 
Recall that invcov^iCj;) is an estimate for and therefore 
invcov^-fCr)" 1 is an estimate for S. We also performed the 
same experiment for the case that the true covariance matrix 
is a multiple of the identity. This corresponds to the case in 
which the sensors are uncorrelated. See Figure 3 and 4. 

Let A be an N x N matrix we define the Frobenius norm 
as ||A|| 2 = ^jTr(yl*A). In Figure 5, 6 and 7 we compare the 
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Conventional vs invcov method with data 100x50 
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Conventional vs invcov method with data 100x50 



— ■- 1 



20 40 60 SO 100 120 
True Eigenvalues 



20 40 60 SO 100 
invcov estimation of Eigenvalues L=20; normalized i UN 



120 




20 40 60 SO 100 
invcov estimation of Eigenvalues L=1 5; normalized s UN 



120 




20 40 60 80 100 
invcov estimation of Eigenvalues L=10; normalized i UN 



120 



Fig. 1. Comparison of the eigenvalues of the true covariance matrix and 
sample covariance matrix vs. invcov estimate. The conventional eigenvalue 
estimate gives 50 eigenvalues that are precisely zero, which is not correct. 
Our invcov estimate is much closer to the actual eigenvalue distribution. 



performance of our estimator with the more standard estimator 
of Ledoit and Wolf [7]. The experiment was performed with a 
60 x 60 true covariance matrix E with Toeplitz entries Ejj = 
exp(— — j\) and N — 30. We compute the Frobenius norm 
|| S — invcov^ifa;) -1 !^ for the different values of L and we 
compare it with ||E - K LW \\ 2 . We use j3 = 10,50 and 100. 
We can see that our method outperform Ledoit and Wolf for 
a big range of the parameter L. We also note that in the three 
cases L = 20 is the optimum. 



X. Comments 

Our investigation indicates that the new method presented in 
this article is interesting and promising from the mathematical 
point of view, as well as the practical one. Even though we 
were able to find the asymptotics formulas as well as close 
form expressions for invcov^, and more generally for fcov^, 
both estimates for E -1 and /(E), there is a natural question 
still unanswered. How to find the optimum LI For instance 
assume we know that the matrix E is an M x M Toeplitz 
distributed with entries Ejj = exp(— 4|i — j\) with unknown 
j3. How does the optimum L depends on M and Nl In the 
simulations presented in Section IX we see that for M — 60, 
N = 30 the optimum L is equal to 20 and it seems to be 
independent on the value of j3. We believe that this a question 
interesting to explore. 



Fig. 2. Comparison of the eigenvalues of the true covariance matrix and 
sample covariance matrix vs. invcov estimate 
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Fig. 3. Comparison of the eigenvalues of the true covariance matrix and 
sample covariance matrix vs. invcov estimate 
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Fig. 4. Comparison of the eigenvalues of the true covariance matrix and 
sample covariance matrix vs. invcov estimate 
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Fig. 5. Performance comparison with respect to Ledoit and Wolf for f} = 10. 
The horizontal blue line is the performance of the LW estimate, our invcov 
estimate outperforms LW for L between 15 and 22. 



invcov vs Ledoit-Wolf : estimation for JT 
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Fig. 7. Performance comparison with respect to Ledoit and Wolf for = 100 
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Average over the permutation matrices 
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In this Appendix we study what happen if we substitute the 
average over the isotropically random unitary matrices by the 
set of permutation matrices. Let D be an N x N diagonal 
matrix with D = diag(<ii, . . . , djv) and L < N. We will 
denote e p the orthonormal vector 1 x 7Y in C N whose entries 
are zero except at the p t/l -coordinate that is 1 . A permutation 
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matrix is a L x TV matrix of the form 

<E> = . where pi 7^ pj for i ^ j. 



E 



Theorem 2: Let D be a diagonal matrix as before. Then 

L 



i $ : $ L x N permutation 



TV 



Proof: We will first show that 

$D$* = diag(d Pl ,dp 2) . . . ,d PL ). 
Given 1 < p < N then e p ■ D = d p e p . Therefore, 



/ d pi e Pl \ 

^P2 e P2 



( 



T T 

P P 

C Pl C P2 



Hence, 
Thus, 



\ d PL e PL J 
= diag(d Pl ,d P2 ,...,d PL ). 

($WT 1 = diag( d" 1 , d" 1 , . . . , <). 

L 

$*($D$*)- 1 $ = ^d- 1 e p :e w . 



(84) 



Note that £ p := eje p is the matrix that has all entries zero 
except at the (p,p) entry which is 1. The total number of 
different permutation matrices is ^ N m L y^ ■ Therefore, 



(N-L)\ 
AM 



E 

(PlviPL) 



1=1 



where the first sum is running over all the possible permutation 
matrices. The number of permutation matrices that have e p as 



one of their rows is n^TWjv^W • Hence, 



(N-L)\ (N-1)\L\ 
Wl (L-1)\(N-L) 



N 



D-\ 
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